An architecture for Malay Tweet normalization

نویسندگان

  • Mohammad Arshi Saloot
  • Norisma Idris
  • Rohana Mahmud
چکیده

Research in natural language processing has increasingly focused on normalizing Twitter messages. Currently, while different well-defined approaches have been proposed for the English language, the problem remains far from being solved for other languages, such as Malay. Thus, in this paper, we propose an approach to normalize the Malay Twitter messages based on corpus-driven analysis. An architecture for Malay Tweet normalization is presented, which comprises seven main modules: (1) enhanced tokenization, (2) In-Vocabulary (IV) detection, (3) specialized dictionary query, (4) repeated letter elimination, (5) abbreviation adjusting, (6) English word translation, and (7) de-tokenization. A parallel Tweet dataset, consisting of 9000 Malay Tweets, is used in the development and testing stages. To measure the performance of the system, an evaluation is carried out. The result is promising whereby we score 0.83 in BLEU against the baseline BLEU, which scores 0.46. To compare the accuracy of the architecture with other statistical approaches, an SMT-like normalization system is implemented, trained, and evaluated with an identical parallel dataset. The experimental results demonstrate that we achieve higher accuracy by the normalization system, which is designed based on the features of Malay Tweets, compared to the SMT-like system. 2014 Elsevier Ltd. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Weighted and Unweighted Transducers for Tweet Normalization

We present two simple finite-state transducer based strategies for tweet normalization. One relies on hand-written correction rules designed to capture commonly occurring misspellings and abbreviations, while the other tries to automatically induce an error model from a gold standard corpus of normalized tweets.

متن کامل

Tweet Normalization with Syllables

In this paper, we propose a syllable-based method for tweet normalization to study the cognitive process of non-standard word creation in social media. Assuming that syllable plays a fundamental role in forming the non-standard tweet words, we choose syllable as the basic unit and extend the conventional noisy channel model by incorporating the syllables to represent the word-to-word transition...

متن کامل

The TALP-UPC Approach to Tweet-Norm 2013

This paper describes the methodology used by the TALP-UPC team for the SEPLN 2013 shared task of tweet normalization (Tweet-Norm). The system uses a set of modules that propose different corrections for each out-of-vocabulary word. The final correction is chosen by weighted voting according to each module accuracy.

متن کامل

Data-Driven Spelling Correction using Weighted Finite-State Methods

This paper presents two systems for spelling correction formulated as a sequence labeling task. One of the systems is an unstructured classifier and the other one is structured. Both systems are implemented using weighted finite-state methods. The structured system delivers stateof-the-art results on the task of tweet normalization when compared with the recent AliSeTra system introduced by Ege...

متن کامل

Isolated Malay Digit Recognition Using Pattern Recognition Fusion of Dynamic Time Warping and Hidden Markov Models

This paper is presents a pattern recognition fusion method for isolated Malay digit recognition using Dynamic Time Warping (DTW) and Hidden Markov Model (HMM). The aim of the project is to increase the accuracy percentage of Malay speech recognition. This study proposes an algorithm for pattern recognition fusion of the recognition models. The endpoint detection, framing, normalization, Mel Fre...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Inf. Process. Manage.

دوره 50  شماره 

صفحات  -

تاریخ انتشار 2014